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AMENDMENT E 

Dear Sir: 

Further to our conversation of January 16, attached is a proposed set of claim 
amendments. The independent claims 1, 16, and 18 include limitations from independent 
claim 20. 
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PROPOSED AMENDMENTS TO THE CLAIMS 

The listing of claims will replace all prior versions, and listings of claims in the 
application: 

LISTING OF CLAIMS: 

1 . (Currently amended) A computer implemented method for identifying output 
documents similar to an input document, comprising: 

(a) receiving the input document that includes textual content 

(b) performing optical character recognition on the textual content to identify text; 

(c) analyzing the text and the textual content to identify keywords, wherein a . 

predefined number of keywords is identified from a first list of rated keywords extracted 
from the input document 

(d) creating a list of best keywords wherein for each keyword remaining in the first 

list of keywords performing the steps, 

(1) identifying the keyword in one or more domain specific dictionaries of 

words and phrases in which they are used; 

(2) identifying combinations of keywords in the list of keywords that satisfy the 

longest phrase: 

(3) determining the freguencv of occurrence in the input document of the 

identified keywords and phrases identified in the one or more domain specific dictionaries; 

(4) setting the linguistic freguencv of occurrence of the keywords and phrases 

to a predefined value; and 

(e) defining a J ist of best keywords, wherein. the list of best .keywords has a rating 

greater than other keywords in the first list of keywords except for keywords belonging to a 
domain specific dictionary of words and having no measurable linguistic frequency; 

(f) .formulating a query using the list of best keywords; 

(g) performing the query to assemble a first set of output documents; 

(h) identifying lists of .keywords for each output document jn the first set of ^ 
documents by tokenizing the keywords at one or more predefined word boundaries while 
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maintaining order of the sequence of the input text and translating the keywords into one or 
more languages; 

(i) .computing a measure of similarity between the input document and each output 
document in the first set of documents; and 

(j) defining a second set of LflocuiTiente.with each document in the first set of . 
documents for which its computed measure of similarity with the input document is greater 
than a predetermined threshold value; wherein the list of best keywords has a maximum 
number of keywords less than the number of keywords in the list of best keywords that are 
identified as belonging to a domain specific dictionary of words and having no measurable 
linguistic frequency, each document in the second set of documents is identified as being 
one of a match, a revision, and a relation of the input document, wherein the query is 
repeated until a predetermined number of results are obtained or the query is terminated 

(k) jf the second set of .documents includes ; a i matching document but no similar 
documents repeating (a)- (il using the matching document to identify similar documents A 
wherein if one or more documents is related to a copyright registered document, the one or 
more documents is rights limited; and 

(0 delivering each document in the second set of documents to one or more 
predetermined output devices, wherein the collection of documents is set forth in a list 
serialized in XML that contains for each document found: its location on a network, original 
representation, unformatted representation, service results, metadata, distance 
measurement, type of document found according to desired Quality, and error status. 

2. (Cancelled) 

3. (Currently amended) The method according to claim 1 , further comprising 
(ml if the second set of document contains an insufficient number of output 

documents, performing query reduction by removing at least one keyword in the list of best 
keywords that is not the keyword that is identified as belonging to a domain specific 
dictionary and having no measurable linguistic frequency. 
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4. (Currently amended) The method according to claim 3 t further comprising if 
after performing (ml the second set of document con 

documents, performing 

(n\; replacing the list of best keywords ^ . 
other keywords in the first list of rated keywords; and repeating (b)-(U, 

5. (Cancelled) 

6. (Currently amended) The method according to claim 4, performing (nl when , 
textual content in the input document is identified using OCR or a portion of the input 
document matches the output document. 

7. (Currently amended) The method according to claim i„ wherein. the 
predefined number of keywords identified from the first list of rated keywords is five. 

8. (Cancelled) 

9. (Original) The method according to claim 1 , further comprising: 
recording a digital image representation of the input document; 
performing OCR on the digital image representation to identify text; 
analyzing the text to identify keywords. 

10. (Currently amended) The method according to claim 1 , further comprising: 

(ol extracting from the .input document the first jist of keywords; 

(pj determining if each keyword in ^ the . firet^ list _qff _ keywords .exists in a domain 

specific dictionary of words; 

{gi for each keyword jn the first list of. keyw of . 

occurrence in the input document, also referred to as its term frequency; 

ftifor each keyword identify 
words, assigning each keyword its linguistic frequency if one exists from a database of 
linguistic frequencies defined using a collection of documents, and assigning its linguistic 
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* 

frequency to a predefined small value if one does not exist in the database of linguistic 
frequencies; 

(si for each keyword, that was not ^ 
words at (h), assigning each keyword its linguistic frequency if one exists in the database of 
linguistic frequencies; and 

(11 for each keyword in the ( first Jist of keywords a 
linguistic frequency are assigned, computing a rating corresponding to its importance in the 
input document that is a function of its frequency of occurrence in the input document and 
its frequency of occurrence in the collection of documents. 

1 1 . (Currently amended) The method according to claim 10, for each keyword 
that was not identified in the domain specific dictionary of words at (pi and that was not 
assigned at (da linguistic frequency from the database of linguistic fr^ 

each that matches a regular expression from a set of regular expressions a predefined 
rating. 

12. (Currently amended) The method according to claim 1 1 , further comprising, 
for each keyword in the first list of keywords, modifying the term frequency of keywords 
determined atfgito a predefined maximum. 

13. (Original) The method according to claim 12, wherein keywords include 
phrases of keywords. 

14. (Original) The method according to claim 1 1 , wherein the rating is a weight 
computed using the following equation: W td = F t ft * log(# / F t ) , where: 

W t d : the weight of term t in document d; 

F td : the frequency occurrence of term t in document d; 

N : the number of documents in the collection of documents; 

F t : the document linguistic frequency of term t in the collection of documents. 
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1 5. (Original) The method according to claim 1 1 , wherein keywords that do not 
match a regular expression from the set of regular expressions are removed from the first 
list of keywords. 



1 6. (Currently amended) A computer implemented method for computing ratings 
of keywords extracted from an input document, comprising: 

(a) determining if each keyword in the list of keywords exists in a domain specific 
dictionary of words by tokenizing the keywords at one or more predefined word boundaries 
while maintaining order of the sequence of the input text and translating the keywords into 
one or more languages; 

(b) determining a frequency of occurrence in the input document for each keyword in 
the list of keywords, also referred to as its term frequency; 

(c) for each keyword identified at (a) that exists in the domain specific dictionary of 
words, assigning each keyword its linguistic frequency if one exists from a database of 
linguistic frequencies defined using a collection of documents, and assigning its linguistic 
frequency to a predefined small value if one does not exist in the database of linguistic 
frequencies; 

(d) for each keyword that was not identified in the domain specific dictionary of 
words at (a), assigning each keyword its linguistic frequency if one exists in the database of 
linguistic frequencies; and 

(e) for each keyword in the list of keywords to which a term frequency and a 
linguistic frequency are assigned, computing a rating corresponding to its importance in the 
input document that is a function of its frequency of occurrence in the input document and 
its frequency of occurrence in the collection of documents, wherein a query reduction is 
performed by removing at least one keyword in the list of best keywords that is identified as 
belonging to a domain specific dictionary and having no measurable linguistic frequency if 
an insufficient number of results are obtained from the list of keywords, wherein the query 
is repeated until a predetermined number of results are obtained or the query is 
terminated^, 

(f) defining a list of best keywords wherein the list of best keywords have a rating 
greater than other keywords in the list of keywords except for keywords belonging to a 



Deleted: fl 

(f) if the second set of documents 
includes a matching document but no 
similar documents repeating (a)-(f) 
using the matching document to 
identify similar documents 
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domain specific dictionary of words and having no measurable linguistic frequency by 
tokenizinq the keywords at one or more predefined word boundaries while maintaining 
order of the sequence of the input text and translating the keywords into one or more 
languages; 

(g) formulating a query using the list of best keywords: 

(h) performing the guerv to assemble a first set of output documents; 

(i) identifying lists of keywords for each output document in the first set of 
documents; 

(i) computing a measure of similarity between the input document and each output 
document in the first set of documents; 

(k) defining a second set of documents with each document in the first set of 
documents for which its computed measure of similarity with the input document is greater 
than a predetermined threshold value; wherein the list of best keywords has a maximum 
number of keywords less than the number of keywords in the list of best keywords that are 
identified as belonging to a domain specific dictionary of words and having no measurable 
- linguistic freguencv, each document in the second set of documents is identified as being 
one of a match, a revision, and a relation of the input document; and 

(I) delivering each document in the collection of documents to a predetermined 
output device, wherein the collection of documents is set forth in a list serialized in XML 
that contains for each document found: its location on a network, original representation, 
unformatted representation, service results, metadata, distance measurement, type of 
document found according to desired quality, and error status. 

17. (Original) The method according to claim 1 6, wherein the keywords in the list 
of keywords are used to carry out one of language identification, indexing, categorization, 
clustering, searching, translating, storing, duplicate detection, and filtering. 

18. (Currently amended) A computer implemented system for identifying output 
documents similar to an input document, comprising: a memory for storing the output 
documents and the input document and processing instructions of the system; and a 
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processor coupled to the memory for executing the processing instructions of the system; 
the processor in executing the processing instructions: 

(a) identifying a predefined number of keywords from a first list of rated keywords 
extracted from the input document 

(b) creating a list of best keywords wherein for each keyword remaining in the first 
list of keywords performing the steps, 

(1) identifying the keyword in one or more domain specific dictionaries of 

words and phrases in which they are used; 

(2) identifying combinations of keywords in the list of keywords that satisfy the 

longest phrase; 

(3) determining the freguencv of occurrence in the input document of the 

identified keywords and phrases identified in the one or more domain specific dictionaries; 

(4) setting the linguistic freguency of occurrence of the keywords and,phrases 

to a predefined value; and 

(c) defining a list of best keywords wherein the list of best keywords havej a rating 

greater than other keywords in the first list of keywords except for keywords belonging to a 
domain specific dictionary of words and having no measurable linguistic frequency by 
tokenizing the keywords at one or more predefined word boundaries while maintaining 
order of the sequence of the input text and translating the keywords into one or more 
languages; 

{formulating a query using the list of best keywords; 

(e) performing the query to assemble a first set of output documents; 

(f) identifying lists of keywords for each output document in the first set of , 
documents; 

(g) pomputinq a measure of similarity between the input ^ . 
document in the first set of documents; 

(h) jdefininq a second set of documen . 
documents for which its computed measure of similarity with the input document is greater 
than a predetermined threshold value; wherein the list of best keywords has a maximum 
number of keywords less than the number of keywords in the list of best keywords that are 
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identified as belonging to a domain specific dictionary of words and having no measurable 
linguistic frequency; and 

(i) j f the second set of document contains an insufficient number of output . 
documents, performing query reduction by removing at least one keyword in the list of best 
keywords that is not the keyword that is identified as belonging to a domain specific 
dictionary and having no measurable linguistic frequency, wherein the query is repeated 
until a predetermined number of results are obtained or the query is terminated 

(i) j f the second set of documents includes a matching document but no similar , 
documents repeating (a)-(ii using the .matching document to identify s -s 
and 

(k) delivering each document in the second set of documents to a predetermined 
output device, wherein the collection of documents is set forth in a list serialized in XML 
that contains for each document found: its location on a network, original representation, 
unformatted representation, service results, metadata, distance measurement, type of 
document found according to desired quality, and error status. 
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19. (Currently amended) The system according to claim 18, wherein the 
processor in executing the processing instructions further comprises: 

^ extracting from the input document the first list of keywords; 

j[m] determining if each .keyword jn .the f i rst list of .keywords exists . in a dorriai n 
specific dictionary of words; 

j[n) for each keyword [ in the first list of keyyvords, means fo 
of occurrence in the input document, also referred to as its term frequency; 

lo\ for each keyword identified .a^m}_that exists in the domain specific dictionary of 
words, means for assigning each keyword its linguistic frequency if one exists from a 
database of linguistic frequencies defined using a collection of documents, and assigning 
its linguistic frequency to a predefined small value if one does not exist in the database of 
linguistic frequencies; 

lp\ for each keyword that was^n of . 

words atjQ), means for assigning each keyword ^ . 
database of linguistic frequencies; and 

lq\ for each keyword in . the 
linguistic frequency are assigned, means for computing a rating corresponding to its 
importance in the input document that is a function of its frequency of occurrence in the 
input document and its frequency of occurrence in the collection of documents. 
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20. (Currently amended) An article of manufacture for identifying output 
documents similar to an input document, the article of manufacture comprising computer 
usable storage media including computer readable instructions embedded therein that 
causes a computer to perform a method, wherein the method comprises: 

(a) identifying a predefined number of keywords from a first list of rated keywords 
extracted from the input document to define a list of best keywords; the list of best 
keywords having a rating greater than other keywords in the first list of keywords except for 
keywords belonging to a domain specific dictionary of words and having no measurable 
linguistic frequency, wherein the keywords are tokenized at one or more predefined word 
boundaries while maintaining order of the sequence of the input text and translating the 
keywords into one or more languages; 

T - 

(b) creating a list of best keywords wherein for each keyword remaining in the first 

list of keywords performing the steps, 

(1) identifying the keyword in one or more domain specific dictionaries of 

words and phrases in which they are used; 

(2) identifying combinations of keywords in the list of keywords that satisfy the 

longest phrase; 

(3) determining the freguencv of occurrence in the input document of the 

. identified keywords and phrases identified in the one or more domain specific dictionaries; 

(4) setting the linguistic freguencv of occurrence of the keywords and phrases 

to a predefined value; and 

(c) defining a list of best keywords wherein the list of best keywords have a rating 

greater than other keywords in the first list of keywords except for keywords belonging to a 
domain specific dictionary of words and having no measurable linguistic freguencv by 
tokenizing the keywords at one or more predefined word boundaries while maintaining 
order of the seguence of the input text and translating the keywords into one or more 
languages; 

(d) formulating a guery using the list of best keywords; 

(e) performing the query to assemble a first set of output documents; . 
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(f) Identifying lists of keywords for each 
documents; 

(q) .computing a measure of similarity between the input document and each output 
document in the first set of documents; 

(h) defining a second set of documents with first set of 
documents for which its computed measure of similarity with the input document is greater 
than a predetermined threshold value; wherein the list of best keywords has a maximum 
number of keywords less than the number of keywords in the list of best keywords that are 
identified as belonging to a domain specific dictionary of words and having no measurable 
linguistic frequency, each document in the second set of documents is identified as being 
one of a match, a revision, and a relation of the input document; and 

(i) j f the second set of document conta 

documents, performing query reduction by removing at least one keyword in the list of best 
keywords that is not the keyword that is identified as belonging to a domain specific 
dictionary and having no measurable linguistic frequency, wherein the query is repeated 
until a predetermined number of results are obtained or the query is terminated 

(\) j f the second set of dqcu ments includes a match ling document .but no sim ilar , 
documents repeating (a)-(jl using the : matching docu^ . 
and 

(k) delivering each document in the second set of documents to a predetermined 
output device, wherein the collection of documents is set forth in a list serialized in XML 
that contains for each document found: its location on a network, original representation, 
unformatted representation, service results, metadata, distance measurement, type of 
document found according to desired quality, and error status. 

21 . (Currently amended) The system according to claim 1 8, further comprising if 
after performing Xi).the second [set of .document cpntams 
documents, performing: 

i!) .replacing the ; Hst of ; best keyvyprds u 
other keywords in the first list of rated keywords; and repeating (b)-jfk). 

12 
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22. (Currently amended) The system according to claim 18, wherein for each 
keyword that was not identified in the domain specific dictionary of words at ^h) and that 
was not assigned at a linguistic frequency from the database of linguistic frequencies, 
assigning each that matches a regular expression from a set of regular expressions a 
predefined rating, wherein the rating is a weight computed using the following equation: 
W td = F td *\og(N/F,), where: 

W td : the weight of term t in document d; 

F td : the frequency occurrence of term t in document d; 

N : the number of documents in the collection of documents; 

F ( : the document linguistic frequency of term t in the collection of documents. 
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CONCLUSION 

For the reasons detailed above, it is submitted all claims remaining in the application 
(Claims 1 , 3-4, 6-7 and 9-22) are now in condition for allowance. The foregoing comments 
do not require unnecessary additional search or examination. 

No additional fee is believed to be required for this Amendment. However, the 
undersigned attorney of record hereby authorizes the charging of any necessary fees, 
other than the issue fee, to Xerox Deposit Account No. 24-0037. 

In the event the Examiner considers personal contact advantageous to the 
disposition of this case, he/she is hereby authorized to call Mark Svat, at Telephone 



Number (216) 861-5582. 


Respectfully submitted, 
FAY SHARPE LLP 


Date 


Mark Svat, Reg. No. 34,261 
Kevin M. Dunn, Reg. No. 52,842 
1100 Superior Avenue, Seventh Floor 
Cleveland, OH 44114-2579 
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